\documentclass{article}
\usepackage{algorithmicx,algorithm,algpseudocode,booktabs,url,amstext}

\renewcommand{\cite}[1]{}


\begin{document}

\begin{abstract}
In this paper, we investigate the potential of GPUs for performing link structure analysis of social graphs.  Specifically, we implement Twitter's WTF (``Who to Follow'') recommendation system on a single GPU\@. Our implementation shows promising results on moderate-sized social graphs. It can return the top-K relevant users for a single user in 172~ms when running on a subset of the 2009 Twitter follow graph with 16 million users and 85 million social relations. For our largest dataset, which contains 75\% of the users (30 million) and 50\% of the social relations (680 million) of the complete follow graph, this calculation takes 1.0~s. We also propose possible solutions to apply our system to follow graphs of larger sizes that do not fit into the on-board memory of a single GPU\@.
\end{abstract}


\section{Introduction}

Many web service providers use recommendation systems to sell products or to increase the value of their service. Shopping services like Amazon suggest products users may be interested in buying, news sites recommend articles based on a user's reading history, and streaming services like Netflix recommend movies and television shows to watch. Social networking services recommend people that the current user may want to connect with. In a social network, where the content is entirely user-generated, a good recommendation system is key in retaining and engaging users.

Twitter is a social media service that allows users to broadcast short, 140-character \emph{tweets} to all users who choose to \emph{follow} them. Twitter's success depends on acquiring and maintaining an active user base. A user's satisfaction with the service depends almost entirely on the content generated by the other users in their social network. They need to subscribe to the tweets of users they are interested in reading updates from. Finding users to follow can be difficult, especially for new users, so Twitter provides them with recommendations of accounts they may be interested in subscribing to. This service is called ``Who to Follow'', or WTF\@. Twitter calculates these recommendations by analyzing the link structure of the follow graph~\cite{Gupta:2013:WTW}. Recommendations should be provided and updated in real time to keep up with changes in the follower graph. For a graph with hundreds of millions of nodes and billions of edges, this is no small problem. Graphics processing units (GPUs) have a highly parallel architecture that is potentially well-suited for this kind of large-scale graph analysis.

\begin{figure*}
  \centering
  \caption{Overview of Twitter's WTF algorithm. \emph{Frame 1:} The initial graph (red [dark] node is the user for whom recommendations are being computed). \emph{Frame 2:} The Circle of Trust (nodes in pink [dark]) is found using Personalized PageRank. \emph{Frame 3:} The graph is pruned to include only the CoT and the users they follow. \emph{Frame 4:} The relevance scores of all users on the right side are computed with Twitter's Money algorithm. Node 4 will be suggested for Node 0 to follow because it has the highest value of all nodes the user is not already following.}
\end{figure*}

GPUs are throughput-oriented, highly parallel architectures that have proven to be very efficient for solving many large-scale problems with plenty of available parallelism. The modern GPU has become the most cost-effective way to perform massive amounts of computation. A single GPU can often outperform a cluster of CPUs; however, they have not yet been able to gain a foothold in data centers. This is due to various issues, such as programmability, limited memory size, and communication costs. GPUs have a very different architecture from CPUs. This makes GPUs efficient at solving large problems, but also creates unique programming challenges.

Solving graph problems on a GPU is particularly difficult because although these problems usually have a large amount of parallelism, that parallelism is irregular: in social graphs, nodes typically have widely varying connectivity and thus wildly varying work per node. As well, graph algorithms usually involve traversing the graph, including paths that split and join along the way; this complicates the control flow, because paths could intersect at any time. Nonetheless, some work has been done on implementing PageRank, a popular graph analysis algorithm, on GPUs~\cite{Rungsawang:2012:FPC,Wu:2010:EPS}; however, this work does not confront the irregular challenges of graph traversal because they use the linear algebra formulation of the PageRank algorithm. GPUs can work well on native graph representations: Merrill et al.\ used GPUs to achieve 4x speedups over multicore CPUs with breadth-first search~\cite{Merrill:2012:SGG}. Our work leverages some of Merrill et al.'s traversal strategies, but maps a more complex graph algorithm to GPUs. Very little work has been done using GPUs for recommendation systems. Srinivasa et al.\ implemented the friends-of-friends algorithm on a GPU~\cite{Srinivasa:2013:GIO}. Friends-of-friends is a basic recommendation algorithm that takes a user and nodes adjacent to the user and returns the set of second level nodes that are adjacent to these nodes. Srinivasa et al.\ only tested their algorithm on small graphs (fewer than 40,000 nodes), but they did achieve a speedup of about 2x over a basic CPU implementation.

\section{Twitter's WTF Algorithm}

Two main stages comprise the WTF recommender. In the first stage, Twitter calculates a Personalized PageRank (PPR) for the user. PPR assigns a ranking to all nodes in the network based on how closely connected they are to the main node (the user interested in recommendations). This ranking is used to find the top 1000 ranking nodes. These nodes form the user's ``Circle of Trust'' (CoT), which consists of the 1000 nodes closest to the user. Pruning the graph in this way increases the personalization of the recommendation and reduces spam. Next, they create a bipartite graph with the CoT on one side, and the users followed by the CoT on the other. All other nodes are pruned from the graph. The final step is Twitter's ``Money'' algorithm, a graph analysis algorithm that determines which accounts the user is most likely to be interested in following. Figure 1 shows a schematic of the entire WTF algorithm.

\subsection{Personalized PageRank}
PageRank is an algorithm for ranking nodes in a graph based on the structure of the graph~\cite{Page:1999:TPC}. It was originally used on the web graph to rank webpages for search engines. The PageRank of a node can be thought of as the probability that a random walker traversing the graph along the edges will land on that node. It is a measure of how well-connected the node is, and therefore, how important it is to the graph. A Personalized PageRank (PPR) calculation relative to node $A$ is identical to the normal PageRank calculation, except all random walks begin at node $A$, rather than a random node. Overall, a Personalized PageRank calculation for $A$ shows which nodes are most closely related to $A$\@.

PageRank can be calculated using Monte Carlo methods or power iteration. A Monte Carlo method for Personalized PageRank would be to actually perform many random walks on the graph and maintain a count of the number of times each node is visited, then use these counts to estimate the stationary distribution. Power iteration methods formulate the problem as a system of linear equations and use linear algebra techniques to solve for the ranking values. Twitter chose a Monte Carlo method for their PPR, while we chose a power iteration method for our implementation. We discuss the reasoning for this decision in Section~\ref{sec:discussion}.

\subsection{Money}
Twitter's Money algorithm~\cite{Goel:2014:TWT} is similar to a combination of Kleinberg's HITS algorithm~\cite{Kleinberg:1999:ASH} and Lempel and Moran's SALSA~\cite{Lempel:2001:STS}. First, the graph is transformed into a bipartite graph, with the CoT on the left side, and the users the CoT follows on the right side, as shown in the second and third stages of Figure 1. If a user is both in the CoT and followed by someone in the CoT, this node will appear on both sides of the bipartite graph. The CoT nodes are assigned a \emph{similarity} value, and the users they follow are assigned a \emph{relevance} value. Initially, the user we are trying to get recommendations for, $C$, has their similarity score set to 1, and all others have their similarity or relevance scores set to 0. The Money algorithm distributes each CoT member's similarity score to the relevance scores of all the users they follow. The followers then distribute their relevance scores back across the graph to all of their followers. As in PPR, the Money algorithm can also be written as a system of linear equations. From the solution of this system of equations, Twitter finds the nodes with the highest relevance scores. These are the accounts they recommend to the user in the Who to Follow feature. The similarity scores are used for other features, such as the ``similar to you'' feature and targeted advertising.

\section{Parallelizing Graph Algorithms on the GPU}

Both PPR and Money are link prediction algorithms that traverse the graph and assign rank values for a subset of nodes. Like many other graph algorithms, they can be viewed as iterative convergent processes. There is a data dependency between iterative steps, but within each step, a large number of independent edge-centric or vertex-centric operations can be parallelized.  However, to fully exploit the compute capabilities of GPUs, we need special strategies to handle irregular memory access and work distribution.

For compact and efficient memory access, we use the compressed sparse row (CSR) format to represent the follow graph on the GPU\@. It uses a column-indices array, $C$, to store a list of neighbor vertices and a row-offsets array, $R$, to store the offset of the neighbor list for each vertex. This representation enables us to use parallel primitives such as prefix sum to reorganize sparse and uneven workloads into dense and uniform ones in all phases of graph processing.

\subsection{Graph Primitives}
To solve the issue of irregular work distribution, we design two graph primitives: \emph{graph traversal} and \emph{filtering}. Graph traversal starts with a queue of vertices we call the \emph{frontier}. Traversal then takes several iterations to advance toward other vertices by visiting the neighbor list of all vertices in the current frontier in parallel and adding the neighbor vertices to the new frontier.  Filtering starts with a frontier that contains either edges or vertices, does a user-defined validation test for every item in the frontier in parallel, then generates a new frontier containing only the items that have passed the validation test.

In PPR, graph traversal and filtering alternate until all the rank values converge and there are no vertices in the frontier for the graph traversal primitive. In Money, we use the graph traversal primitive to bounce back and forth between two disjoint sets in a bipartite graph: the CoT and the union of all CoT vertices' neighbor lists. Since in graph traversal primitives, the neighbor lists can vary greatly in size, traversing these neighbor lists in parallel efficiently and in a load-balanced way is critical for the performance of our system.  Merrill et al.~\cite{Merrill:2012:SGG} uses specialized strategies according to neighbor list size in the context of a parallel breadth-first search (BFS) algorithm. The algorithm efficiently reduces the amount of overhead within each kernel and better utilizes the GPU\@. Our implementation extends this strategy with two important improvements. First, we add inline device functions to perform user-specific computations and to reuse graph traversal primitives for both PPR and Money by just replacing the function we load when visiting edges and vertices. Second, BFS's operations on vertices are idempotent, but ours are not, so we guarantee the correctness of both algorithms by adding atomics to resolve race conditions between edges converging on a vertex.

\subsection{Application to Other Graph Algorithms}
The graph traversal primitive distributes the parallel workload per edge or per node in a graph to tens of thousands of threads on the GPU to process in parallel, and the filtering primitive reorganizes the elements we want to process in parallel for the next iteration. By reusing these two primitives, we have implemented several graph-traversal-based ranking algorithms such as PageRank, PPR, Money, HITS, and SALSA with minimal programming cost. Several of these algorithms run on bipartite graphs, which are key abstractions in several ranking and link prediction algorithms. As far as we know, we are the first to target bipartite graph algorithms on the GPU\@.

We implement bipartite graphs in the following way. A directed bipartite graph is similar to a normal undirected graph, except in a directed bipartite graph, we need to consider the outgoing edges and the incoming edges of a vertex separately. We achieve this by reversing the source and destination vertex ID for each edge while constructing the CSR data structure to record the incoming edges and using an additional prefix sum pass to compute the incoming degree for each vertex. In our graph primitives, we also added a feature to switch between visiting the outgoing edges of a vertex and visiting the incoming edges of a vertex. We use this method in our SALSA implementations. In our Money and HITS implementation, we do not need to calculate the in-degree of every node in the graph, but just the ones connected to the CoT\@. We take advantage of this by finding the incoming degree values for neighbors of the CoT on the fly with an additional pass of the graph traversal primitive. Because of the small size of the CoT, this extra pass takes negligible time but saves us gigabytes of memory for large datasets.

\section{WTF Implementation}
In our implementation, we start by putting all the graph topology information on the GPU\@. First, we compute the PPR value for each vertex in the follow graph with a user-defined seed vertex. Then we radix-sort the PPR values and take the vertices with the top $k$ PPR values (in our implementation, $k=1000$) as the CoT and put them in a frontier. We then run the Money algorithm, sort the vertices that the CoT follows, and finally extract the vertices with the top relevance scores to use for our recommendation.

\subsection{Personalized PageRank} Algorithm~\ref{alg:ppr} shows our implementation of PPR using the graph primitives and functors we design. We update PPR using the following equation:
\begin{equation}
  \textit{PPR}(v_i) = \left\{
    \begin{array}{l l}
      (1-\delta) + \delta \cdot \!\!\!\!\!\sum\limits_{v_j \in \textit{pred}(v_i)} \!\!\frac{\textit{PPR}(v_j)}{d_\text{OUT}(v_j)} & \text{$v_i$ is seed}\\
      \delta \cdot \sum\limits_{v_j \in \textit{pred}(v_i)} \frac{\textit{PPR}(v_j)}{d_\text{OUT}(v_j)} & \text{otherwise}
    \end{array}
  \right.
\end{equation}
where $\delta$ is a constant damping factor (typically set to 0.85), $d_\text{OUT}(v_j)$ is the number of outbound edges on vertex $v_j$, and $N$ is the total number of vertices in the graph.

The PPR algorithm starts by initializing problem data (line~\ref{alg:ppr:problemdata}). It assigns the initial rank value for each vertex as $\frac{1}{N}$, and puts all vertices in the initial frontier. The main loop of the algorithm contains two steps. For each iteration, we first use GraphTraversal to visit all the edges for each vertex in the frontier and distribute each vertex's PPR value to its neighbors (line~\ref{alg:ppr:distribute}). Then we use Filtering to update the PPR value for the vertices that still have unconverged PPR values. We give the seed vertex an extra value (line~\ref{alg:ppr:seed}) as in the equation. Then we remove the vertices whose PPR values have converged from the frontier (line~\ref{alg:ppr:filter}). The algorithm ends when all the PPR values converge and the frontier is empty.

\begin{algorithm}
\caption{Personalized PageRank} \label{alg:ppr}
  \begin{algorithmic}[1]
    \Procedure{Set\_Problem\_Data}{$G, P, \textit{seed}, \textit{delta}$} \label{alg:ppr:problemdata}
      \State $P.\textit{ranks\_curr}[1..G.\textit{verts}] \gets \frac{1}{N}$
      \State $P.\textit{src} \gets \textit{seed}, P.\textit{delta} \gets \textit{delta}$
      \State $P.\textit{frontier.Insert}(G.\textit{nodes})$
    \EndProcedure

    \Procedure{DistributePPRValue}{$s\_\textit{id}, d\_\textit{id}, P$}
      \State $\textit{atomicAdd}(P.\textit{rank\_next}[d\_\textit{id}],\frac{P.\textit{rank\_curr}[s\_\textit{id}]}{P.\textit{out\_degree}[s\_\textit{id}]})$ \label{alg:ppr:distribute}
    \EndProcedure

    \Procedure{UpdatePPRValue}{$\textit{node}\_\textit{id}, P$}
      \State $P.\textit{rank}\_\textit{next}[\textit{node}\_\textit{id}] \gets (P.\textit{delta} \cdot P.\textit{rank}\_\textit{next}[\textit{node}\_\textit{id}]) + (P.\textit{src} == \textit{node}\_\textit{id}) ? (1.0 - P.\textit{delta}) : 0$ \label{alg:ppr:seed}
      \State $\textit{diff} \gets \textit{fabs}(P.\textit{rank\_next}[\textit{node\_id}]-P.\textit{rank\_curr}[\textit{node}\_\textit{id}])$
      \State \textbf{return} $\textit{diff} > P.\textit{threshold}$ \label{alg:ppr:filter}
    \EndProcedure

    \Procedure{Compute\_PPR}{$G,P,\textit{seed}, \textit{delta}, \textit{max\_iter}$}
      \State \Call {Set\_Problem\_Data}{$G,P,\textit{seed}, \textit{delta}$}
      \While {$P.\textit{frontier.Size}() > 0$}
        \State \Call {GraphTraversal}{$G,P,\textit{DistributePPRValue}$}
        \State \Call {Filtering}{$G,P,\textit{UpdatePPRValue}$}
        \State $swap(P.\textit{rank\_next}, P.\textit{rank\_curr})$
      \EndWhile
    \EndProcedure
  \end{algorithmic}
\end{algorithm}

\subsection{Twitter's Money Algorithm} Algorithm~\ref{alg:money} shows our implementation of Twitter's Money algorithm. We treat people in the CoT and people they follow as two disjoint sets $X$ and $Y$ in the bipartite graph $G = (X \cup Y, E)$ they form. For each node in $X$ (the CoT), the Money algorithm computes its similarity score; for each node in $Y$, the Money algorithm computes its relevance score. We use the following equations in our implementation:
\begin{equation}
  sim(x) = \left\{
  \begin{array}{l l}
    \alpha + (1-\alpha) \cdot \sum\limits_{(x,y) \in E} \frac{\textit{relevance}(y)}{d_\text{IN}(y)} & \text{$x$ \textrm{is seed}}\\
    (1-\alpha) \cdot \sum\limits_{(x,y) \in E} \frac{\textit{relevance}(y)}{d_\text{IN}(y)} & \text{otherwise}
  \end{array} \right.
\end{equation}
\begin{equation}
  \textit{relevance}(y) = \sum\limits_{(x,y)\in E} \frac{\textit{sim}(x)}{d_\text{OUT}(x)}
\end{equation}

In the Money algorithm, we also initialize problem data first (line~\ref{alg:money:problemdata}). We set the similarity score for the seed vertex to 1, and set the similarity and relevance scores for all other vertices to 0. We then put all vertices in the CoT in the initial frontier. The main loop of the algorithm contains two steps of GraphTraversal, for each iteration we use the first GraphTraversal to visit all the edges for each vertex in the CoT and distribute each vertex's similarity score to its neighbors' relevance scores (line~\ref{alg:money:relevance}). After this step, we update the relevance scores so that they may be used in the computation of similarity scores (line~\ref{alg:money:swaprel}). The second GraphTraversal will visit all the edges for each vertex in the CoT again, but this time it will distribute the neighbors' relevance scores back to each CoT vertex's similarity score (line~\ref{alg:money:similarity}). Again, we treat the seed vertex differently so that it can get more similarity score during each iteration, which in turn will affect its neighbors' relevance scores and other CoT vertices' similarity scores. After the second GraphTraversal step, we update the similarity scores (line~\ref{alg:money:swapsim}). As Twitter does in their Money algorithm~\cite{Goel:2014:TWT}, we end our main loop after $1/\alpha$ iterations.

\begin{algorithm}
  \caption{Twitter's Money Algorithm} \label{alg:money}
  \begin{algorithmic}[1]
    \Procedure{Set\_Problem\_Data}{$G, P, \textit{seed}, \textit{alpha}$}\label{alg:money:problemdata}
      \State $P.\textit{relevance\_curr}[1..G.\textit{verts}] \gets 0$
      \State $P.\textit{sim\_curr}[1..G.\textit{verts}] \gets 0$
      \State $P.\textit{src} \gets \textit{seed}, P.\textit{alpha} \gets \textit{alpha}$
      \State $P.\textit{sim\_curr}[\textit{seed}] \gets 1$
      \State $P.\textit{frontier.Insert}(G.\textit{cot\_queue})$
    \EndProcedure

    \Procedure{DistributeRelevance}{$s\_\textit{id}, d\_\textit{id}, P$}
      \State $atomicAdd(P.\textit{relevance\_next}[d\_\textit{id}],\frac{P.\textit{sim\_curr}[s\_\textit{id}]}{P.\textit{out\_degree}[s\_\textit{id}]})$ \label{alg:money:relevance}
    \EndProcedure

    \Procedure{DistributeSim}{$s\_id, d\_id, P$}
      \State $val \gets (1-P.\textit{alpha}) \cdot \frac{P.\textit{relevance\_curr}[d\_\textit{id}]}{P.\textit{in\_degree}[d\_\textit{id}]} + (P.\textit{src} == s\_\textit{id}) ? (\frac{P.\textit{alpha}}{P.\textit{out\_degree}[s\_\textit{id}]}) : 0$    \label{alg:money:similarity}
      \State $\textit{atomicAdd}(P.\textit{sim\_next}[d\_\textit{id}], \textit{val})$
    \EndProcedure

    \Procedure{Compute\_Money}{$G,P,\textit{seed}, \textit{alpha}$}
      \State \Call {Set\_Problem\_Data}{$G,P,\textit{seed}, \textit{alpha}$}
      \For {$iter\text{\raisebox{.4ex}{\tiny{++}}} < 1/\textit{alpha}$}
        \State \Call {GraphTraversal}{$G,P,\textit{DistributeRelevance}$}
        \State $swap(P.\textit{relevance\_next}, P.\textit{relevance\_curr})$ \label{alg:money:swaprel}
        \State \Call {GraphTraversal}{$G,P,\textit{DistributeSim}$}
        \State $swap(P.\textit{sim\_next}, P.\textit{sim\_curr})$ \label{alg:money:swapsim}
      \EndFor
    \EndProcedure
  \end{algorithmic}
\end{algorithm}

\section{Experiments}\label{sec:experiment}
We ran all experiments in this paper on a Linux workstation with 2 $\times$ 2.53~GHz Intel 4-core E5630 Xeon CPUs, 12~GB of main memory, and an NVIDIA K40c GPU with 12~GB on-board memory. The parallel programs were compiled with NVIDIA's nvcc compiler (version~6.0.1) with the -O3 flag. The sequential programs were compiled using gcc 4.6.3 with the -O3 flag. The datasets used in our experiments are shown in Table~\ref{tab:dataset}, and Table~\ref{tab:runtimes} shows the runtimes of our GPU recommendation system on these datasets. Runtimes are for GPU computation only and do not include CPU-GPU transfer time; we discuss why in Section~\ref{sec:discussion}. The wiki-Vote dataset is a real social graph dataset that contains voting data for Wikipedia administrators; all other datasets are follow graphs from Twitter and Google Plus~\cite{Leslovec:2014:SLN,Kwak:2010:WIT}. Twitter09 contains the complete Twitter follow graph as of 2009; we extract 75\% of its user size and 50\% of its social relation edge size to form a partial graph that can fit into the GPU's memory.

\begin{table}
  \centering
  \caption{Experimental Datasets\label{tab:dataset}}
  \begin{tabular}{*{6}{l}}
    \toprule Dataset & Vertices&Edges \\
    \midrule wiki-Vote & 7.1k & 103.7k \\
    twitter-SNAP & 81.3k & 2.4M \\
    gplus-SNAP & 107.6k & 30.5M \\
    twitter09 & 30.7M & 680M \\
    \bottomrule
  \end{tabular}
\end{table}

\subsection{Scalability}

\begin{table}
  \centering
  \caption{Runtimes for Different Graph Sizes\label{tab:runtimes}}
  \begin{tabular}{*{6}{r}}
    \toprule Time (ms) & wiki-Vote & twitter & gplus & twitter09 \\
    \midrule PPR & 0.45 & 0.84 & 4.74 & 832.69\\
    CoT & 0.54 & 1.28 & 2.11 & 51.61\\
    Money & 2.70 & 5.16 & 18.56 & 158.37\\
    Total & 4.37 & 8.36 & 26.57 & 1044.99\\
    \bottomrule
  \end{tabular}
\end{table}

In order to test the scalability of our WTF-on-GPU recommendation system, we ran WTF on six differently-sized subsets of the twitter09 dataset. The results are shown in Figure~\ref{fig:runtime-vs-edge-count}. We see that the implementation scales sublinearly with increasing graph size. As we double the graph size, the total runtime increases by an average of 1.684x, and the runtime for Money increases by an average of 1.454x. The reason lies in our work-efficient parallel implementation. By doing per-vertex computation exactly once and visiting each edge exactly once, our parallel algorithm performs linear $O(m+n)$ work. The reason that we have better scalability for the Money algorithm is that although we are doubling the graph size each time, the CoT size is fixed at 1000. We address scalability beyond a single GPU in Section~\ref{sec:discussion}.

\begin{figure}
  \centering
  \caption{Scalability of runtime versus edge count for our GPU recommendation system.\label{fig:runtime-vs-edge-count}}
\end{figure}

\subsection{Comparison to Cassovary}
We chose to use the Cassovary graph library for our CPU performance comparison. The results of this comparison are shown in Table~\ref{tab:cassovary}. Cassovary is a graph library developed at Twitter. It was the library Twitter used in their first WTF implementation~\cite{Gupta:2013:WTW}. The parameters and Cassovary function calls for this implementation can be found in the Appendix.
\begin{table*}
  \centering
  \caption{Runtime comparison to Cassovary.\label{tab:cassovary}}
  \begin{tabular}{*{10}{r}}
    \toprule & \multicolumn{2}{c}{wiki-Vote} & \multicolumn{2}{c}{twitter} & \multicolumn{2}{c}{gplus} & \multicolumn{2}{c}{twitter09} \\
    \midrule Step (runtime) & Cassovary & GPU & Cassovary & GPU & Cassovary & GPU & Cassovary & GPU\\
    \midrule PPR (ms) & 418 & 0.45 & 480 & 0.84 & 463 & 4.74 & 884 & 832.69\\
    CoT (ms) & 262 & 0.54 & 2173 & 1.28 & 25616 & 2.11 & 2192 & 51.61\\
    Money/SALSA (ms) & 357 & 2.70 & 543 & 5.16 & 2023 & 18.56 & 11216 & 158.37\\
    Total (ms) & 1037 & 4.37 & 3196 & 8.36 & 28102 & 26.57 & 14292 & 1044.99\\
    Speedup & \multicolumn{2}{r}{235.7} & \multicolumn{2}{r}{380.5} & \multicolumn{2}{r}{1056.5} & \multicolumn{2}{r}{13.7}\\
    \bottomrule
  \end{tabular}
\end{table*}

We achieve speedups of up to 1000x over Cassovary for the Google Plus graph, and a speedup of 14x for the 2009 Twitter graph, which is the most representative dataset for the WTF application. One difference between the GPU algorithm and the Cassovary algorithm is that we used the SALSA function that comes with the Cassovary library, instead of using Twitter's Money algorithm for the final step of the algorithm. Both are ranking algorithms based on link analysis of bipartite graphs, and in the original Who-To-Follow paper~\cite{Gupta:2013:WTW}, Gupta et al.\ use a form of SALSA for this step, so this is reasonable for a comparison.

\section{Discussion}
\label{sec:discussion} On a graph with more than 175 million nodes and 20 billion edges, the WTF algorithm currently takes around 500~ms in Twitter's data center~\cite{Myers:2014:INS}. In contrast, our implementation can process a graph with 25 million nodes and 340 million edges in a similar amount of time (524~ms), and it takes 1 second to process our largest dataset, which is still significantly smaller than the complete Twitter graph.

The Personalized PageRank calculation takes up the vast majority of the runtime for larger graphs (Table~\ref{tab:runtimes}). This is because PPR runs on the entire graph, but Money only runs on the pruned CoT graph, which does not grow as quickly. One possible way to reduce the runtime would be to precompute the CoT, and only run Money to update WTF\@. In this scheme, PPR would only be run periodically, and it could be an offline process. Alternatively, an incremental PPR calculation, as in Bahmani et al.~\cite{Bahmani:2010:FIP}, could provide an estimate of the new CoT without needing to iterate through the entire graph. With the precomputed CoT, we would only need to run Twitter's Money algorithm to get the result. According to our sublinear scalability model of Twitter's Money algorithm runtime, we could compute the result in 300~ms on Twitter's circa-2009 follow graph.

We also note that our PPR calculation uses a power iteration method, while Twitter's is a Monte Carlo method. Both methods are quite accurate, but there are a few trade-offs~\cite{Avrachenkov:2007:MCM}. The power iteration method is deterministic. Every time the algorithm runs on the same set of data, we will get the same results. Monte Carlo methods involve actually performing random walks to compute the probability distribution, so the outcome will not be exactly the same every time. In terms of performance, Monte Carlo methods give a reasonable approximation after only the first iteration, but the error decreases very slowly with more iterations. Power iteration, on the other hand, starts with a much more inaccurate estimation, but the error decreases and converges quicker than for Monte Carlo. We chose power iteration because it is both efficient and easy to implement on a GPU; however, it is possible that a Monte Carlo method would perform better on a GPU\@. Because Monte Carlo methods involve many independent walks, they are potentially well-suited to the GPU's massively parallel architecture.

The difference between Monte Carlo and iterative methods can be seen in the camparison with the Cassovary runtimes in Table~\ref{tab:cassovary}. While our GPU version is PPR-limited, the Cassovary implementation tends to be SALSA-limited. This is because of the difference in the way we compute Personalized PageRank. The Cassovary library PPR function uses the Monte Carlo random walks method, walking the graph and maintaining a visit count at each node, then ranking nodes by the number of visits. The number of steps is fixed, so the Cassovary PPR scales well. Our method iterates until convergence and computes the PPR for every node, not just ones that are in the vicinity of the start node.

One major limitation of our current implementation is the size of GPU memory. Today's GPUs have at most 12~GB of memory, whereas a CPU system can easily have ten times as much memory. This means that the entire current Twitter follow graph cannot fit in GPU memory. The twitter09 dataset is close to the maximum size that can fit on a single GPU\@. To compute WTF on the full graph, we would need to partition the graph and/or distribute it across multiple nodes. Scaling any large-scale parallel data analysis system, especially online social network system, beyond a single GPU remains a major challenge today. The unstructured and highly irregular connectivity of power-law graphs like social graphs makes it difficult to design partitioning and synchronization strategies for such graphs.

Another limitation is the bandwidth of the connection between main memory and GPU memory. Results in Section~\ref{sec:experiment} do not include the data transfer time from CPU memory to the GPU\@. For our largest graph,  the transfer time is 852.24~ms---about 85\% of the compute time. In our experiments, we assume that graph data will be resident on the GPU, because it will be used to run a variety of algorithms on the follow graph, so transfer time will not be significant overall; however, for a different use case or very frequent updates to the graph, data transfer time could seriously limit performance.

Fortunately, future GPU systems have potential solutions for the modest size of today's GPU memories. The next node on NVIDIA's GPU roadmap is ``Pascal'' (2016), which can be connected to the CPU via a high-speed ``NVLink'' connection that allows access to the CPU's main memory at CPU-main-memory speeds (as well as supporting unified virtual memory across CPU and GPU\@). While such systems will still require careful memory management, they eliminate the current performance disadvantage in the case data fits in CPU memory but not GPU memory.


\section{Conclusions and Future Work} In this work, we have shown that it is possible to use a GPU for recommendation algorithms on social graphs, but there are still many ways in which the performance could be improved. Software platforms for large-scale online social network analysis on hybrid CPU-GPU architectures could potentially offer better throughput and performance on systems that are more cost-effective than today's CPU-based cluster architectures. However, moving workloads to GPUs is challenging for the following reasons:

\begin{itemize}
\item Designing parallel algorithms for such systems is both difficult and time consuming.
\item Today, limited PCIe bandwidth is a constraint for using discrete GPUs for communication-bounded tasks.
\item The limited memory size on current GPUs makes it difficult to run algorithms on datasets that cannot fit in the GPU memory.
\end{itemize}

With the appearance of more graph processing libraries on the GPU, the design and implementation of online social network analysis software on the GPU is becoming easier and more efficient. Today, most online social network analysis algorithms are still compute-bound when running on large datasets. This makes a multi-CPU/multi-GPU architecture running across multiple nodes a promising solution.

We propose a two-layer framework running on a three-layer memory hierarchy where GPU memory serves as the fast cache, and CPU main memory serves as the second level cache, sitting atop data stored on hard disks/SSDs. In our case of building a recommendation system, the entire follow graph will be stored in CPU memory with disk as backing store. Multiple nodes, each containing one or more GPUs, will store the CoT for each user in the graph. Currently Twitter has 250 million users.  If we keep the size of CoT at 1000 and use unsigned integers for vertex IDs, then we will need 2~TB to store CoTs for all users. That can be easily partitioned by user ID to fit on 4 or more machines with 512~GB or less of main memory. In this case, we can run PPR as an offline algorithm that runs only once per day or after a certain number of graph updates. The recommendation system can then work in real time with around 100~ms running time. Because the number of vertices in CoT is a constant 1000, the pruned graph that contains only vertices in CoT and vertices they connect to will be much smaller than the original follow graph. This set-up would reduce both the computational workload and GPU memory requirements.

\bibliographystyle{abbrv} %\bibliography{wtf}
\begin{thebibliography}{10}

  \bibitem{Avrachenkov:2007:MCM}
  K.~Avrachenkov, N.~Litvak, D.~Nemirovsky, and N.~Osipova.
  \newblock {M}onte {C}arlo methods in {PageRank} computation: When one iteration is sufficient.
  \newblock {\em SIAM Journal of Numerical Analysis}, 45(2):890--904, Feb. 2007.

  \bibitem{Bahmani:2010:FIP}
  B.~Bahmani, A.~Chowdhury, and A.~Goel.
  \newblock Fast incremental and personalized {PageRank}.
  \newblock {\em Proceedings of the VLDB Endowment}, 4(3):173--184, Dec. 2010.

  \bibitem{Goel:2014:TWT}
  A.~Goel.
  \newblock The ``who-to-follow'' system at {T}witter: Algorithms, impact, and further research.
  \newblock WWW 2014 industry track, 2014.

  \bibitem{Gupta:2013:WTW}
  P.~Gupta, A.~Goel, J.~Lin, A.~Sharma, D.~Wang, and R.~Zadeh.
  \newblock {WTF}: The who to follow service at {T}witter.
  \newblock In {\em Proceedings of the International Conference on the World Wide Web}, pages 505--514, May 2013.

  \bibitem{Kleinberg:1999:ASH}
  J.~M. Kleinberg.
  \newblock Authoritative sources in a hyperlinked environment.
  \newblock {\em Journal of the ACM}, 46(5):604--632, Sept. 1999.

  \bibitem{Kwak:2010:WIT}
  H.~Kwak, C.~Lee, H.~Park, and S.~Moon.
  \newblock {W}hat is {T}witter, a social network or a news media?
  \newblock In {\em Proceedings of the International Conference on the World Wide Web}, pages 591--600, Apr. 2010.

  \bibitem{Lempel:2001:STS}
  R.~Lempel and S.~Moran.
  \newblock {SALSA}: The stochastic approach for link-structure analysis.
  \newblock {\em ACM Transactions on Information Systems}, 19(2):131--160, Apr. 2001.

  \bibitem{Leslovec:2014:SLN}
  J.~Leskovec.
  \newblock {SNAP}: Stanford large network dataset collection.
  \newblock \url{http://snap.stanford.edu/data/}.
  \newblock Accessed: 2014-05-18.

  \bibitem{Merrill:2012:SGG}
  D.~Merrill, M.~Garland, and A.~Grimshaw.
  \newblock Scalable {GPU} graph traversal.
  \newblock In {\em Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, PPoPP '12, pages 117--128, Feb. 2012.

  \bibitem{Myers:2014:INS}
  S.~A. Myers, A.~Sharma, P.~Gupta, and J.~Lin.
  \newblock Information network or social network?: The structure of the {T}witter follow graph.
  \newblock In {\em Proceedings of the Companion Publication of the International Conference on the World Wide Web}, WWW Companion '14, pages 493--498, Apr. 2014.

\b  ibitem{Page:1999:TPC}
  L.~Page, S.~Brin, R.~Motwani, and T.~Winograd.
  \newblock The {PageRank} citation ranking: Bringing order to the web.
  \newblock Technical Report 1999-66, Stanford InfoLab, Nov. 1999.

  \bibitem{Rungsawang:2012:FPC}
  A.~Rungsawang and B.~Manaskasemsak.
  \newblock Fast {PageRank} computation on a {GPU} cluster.
  \newblock In {\em Proceedings of the 2012 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages 450--456, Feb. 2012.

  \bibitem{Srinivasa:2013:GIO}
  K.~G. Srinivasa, K.~Mishra, C.~S. Prajeeth, and A.~M. Talha.
  \newblock {GPU} implementation of friend recommendation system using {CUDA} for social networking services.
  \newblock In {\em Proceedings of the International Conference on Emerging Research in Computing, Information, Communication, and Applications}, pages  890--895, Aug. 2013.

  \bibitem{Wu:2010:EPS}
  T.~Wu, B.~Wang, Y.~Shan, F.~Yan, Y.~Wang, and N.~Xu.
  \newblock Efficient {PageRank} and {SpMV} computation on {AMD} {GPU}s.
  \newblock In {\em Proceedings of the 39th International Conference on Parallel   Processing}, pages 81--89, Sept. 2010.

\end{thebibliography}

\appendix
Here are the parameters and function calls used in our comparison using the Cassovary graph library. Our inputs for the personalized PageRank computation are as follows:

\begin{small}
\begin{verbatim}
val numSteps = 100L * 1000L
val resetProb = 0.15
val maxSteps = None
val pathsSaved = Some(2)
val walkParams = RandomWalkParams(numSteps, resetProb, maxSteps, pathsSaved)
val graphUtils = new GraphUtils(graph)
val (topNeighbors, paths) = graphUtils.calculatePersonalizedReputation(startNode, walkParams)
\end{verbatim}
\end{small}

\noindent
Parameters for the SALSA calculation:

\begin{small}
\begin{verbatim}
val leftResetProb = 0.2
val rightResetProb = 0
val numTopContributors = 5
val SALSA = new IterativeLinkAnalyzer(bipartiteGraphUtils, leftResetProb, rightResetProb, numTopContributors)
val numIterations = 5
val (topSimilarities, topRelevances) =
  SALSA.analyze(leftNodeInfo, numIterations, {LHSNodes => LHSNodes.neighborIds(OutDir)})
\end{verbatim}
\end{small}

\end{document}
